Install the rmarkdown package
This is a latex equation, \(E = mc^{2}\)
This is a inline code print("this is an inline code")
Set echo = TRUE if you want to include source code in the output
For Spanish beginners see: https://markdown.es/sintaxis-markdown/
R Markdown Cheat Sheet: https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf
Other good one: https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf
Latex symbols: https://en.wikibooks.org/wiki/LaTeX/Mathematics
Binary or binomial classification is the task of classifying the elements of a given set into two groups (predicting which group each one belongs to) (Wikipedia - Binary Classification (2017)).
An example:
In statistics, logistic regression, or logit regression, or logit model[1] is a regression model where the dependent variable (DV) is categorical (Wikipedia - Logistic Regression (2017)).
Find the maths here:
Given \(x\) and \(y\) find \(\hat{y}\) such as \(\hat{y} = P(y = 1 | x)\) where \(0 \leq \hat{y} \leq 1\)
We need to find \(w \in \mathbb{R}^n\) and \(b \in \mathbb{R}\) where \(\hat{y} = \sigma(w^Tx + b) = \sigma(z) \approx y\) and \(\sigma(z) = \frac{1}{1 + e^{-z}}\)
See the graph in R
Sigmoid <- function(x) {
1 / (1 + exp(-x))
}
# feed with data
x <- seq(-5, 5, 0.01)
# and plot
plot(x, Sigmoid(x), col='blue', ylim = c(-.2, 1))
abline(h = 0, v = 0, col = "gray60")
The idea is the following:
If z is very large \(\sigma(z) \approx 1\), but if z is a large negative number \(\sigma(z) \approx 0\)
As \(\hat{y} \approx y\), we have a loss or error function:
The logistic regression cost function is the following:
\(\ell(y, \hat{y}) = -(ylog(\hat{y}) + (1 - y)log(1 - \hat{y})\)
Why this loss function?
Because if \(y = 0\) we want \(\hat{y}\) small, if \(y = 1\) we want a large value for \(\hat{y}\). In another case, we want to penalize the result.
The cost function aggregates the loss function result for every \((x_{i}, y_{i})\). Given n, the sample size, the cost function is: \(\jmath(w, b) = \frac{1}{n}\sum_{1}^{n}\ell(y^{i} \hat{y}^{i}) = -\frac{1}{n}\sum_{1}^{n}((y^{i}log(\hat{y}^{i}) + (y^{i} +1)log(1 - \hat{y}^{i}))\)
Remember: \(\hat{y} = h_{\theta} = \sigma(w^Tx + b)\)
# Ref: https://www.r-bloggers.com/logistic-regression-with-r-step-by-step-implementation-part-2/
# Cost Function
#
CostFunction <- function(parameters, X, Y) {
n <- nrow(X)
# function to apply (%*% Matrix multiplication)
g <- Sigmoid(X %*% parameters)
J <- (1/n) * sum((-Y * log(g)) - ((1 - Y) * log(1 - g)))
return(J)
}
So, our mission is to find w, and b, such that minimizes \(\jmath(w, b)\)
To find maximums and minimums we use derivatives. And here the gradient descent appears (Raschka (2017)).
And, why derivatives?
Use the chain rule to obtain the derivatives.
The learning rate is paramount. If you set a very high learning rate the algorithm might not converge. On the contrary, with a very low learning rate the learning process could be slow (Magdon-Ismail (2017)).
Due to this kind of choices, experience is a high valueble assset (and Data Science become an Art in something aspects)
The algorithm is the following (represents the learning rate):
Repeat until convergence:
\(w = w - \alpha\frac{\partial\jmath(w, b)}{\partial w}\)
\(b = b - \alpha\frac{\partial\jmath(w, b)}{\partial b}\)
end algorithm
Then, remember we want to find W and b that minimizes the cost function not the \(\hat{y}\) function.
\[ \frac{\partial\jmath(w, b)}{\partial \theta_j} = \frac{1}{n} \sum\limits_{i=1}^n (h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)} \] (Fok (2012))
where \(h_{\theta} = \sigma\)
As \(\frac{1}{n}\) is a constant, \(\alpha\frac{1}{n}\) is constant too.
Other way to see this, is the following. Consider the algorithm as:
Repeat until convergence:
\(w = w - \alpha\frac{\partial\ell(y, \hat{y})}{\partial w}\)
\(b = b - \alpha\frac{\partial\ell(y, \hat{y})}{\partial b}\)
end algorithm
NOTE (trick for the derivatives, using the chain rule):
Let \(a\) be the sigmoid functions, such that:
\(a = \sigma(z)\), \(-e^{-x} = 1 - (1 + e^{-x}) = 1 - \frac{1}{a} = \frac{a-1}{a}\)
An algorithm that always terminates with a desired solution in a finite number of iterations is a finite algorithm.
An iterative algorithm is said to converge when as the iterations proceed the output gets closer and closer to a specific value. Convergence to a local minimum can only be guaranteed when the function function is convex.
We can set up a stopping criterion, such as a finite number of iterations or a required improvement in the cost function.
Iterative algorithms are the core of Machine Learning, and for this reason, parallel computing is so important in Machine Learning and Data Science.
Suppose that you are an administrator of a university and you want to know the chance of admission of each applicant based on their two exams. You have historical data from previous applicants which can be used as the training data for logistic regression. Your task is to build a classification model that estimates each applicant’s probability of admission in university (Source:coursera machine learning class and Gondaliya (2013)).
Now, we have understood classification problem that we are going to address. Let us understand the data. In data we have records of previous applicants’ two exams score and label whether applicant got admission or not (1 - if got admission 0 - otherwise).
Analyze your data
#Load data
data <- read.csv("data/4_1_data.csv")
#Create plot
plot(data$score.1, data$score.2, col = as.factor(data$label), xlab = "Score-1", ylab = "Score-2")
In this case we have two predictor variables (two exams’ scores) and label as response variable. The equation is the following:
\(\hat{y} = \sigma(w_1x_1 + w_2*x_2 + b)\)
Then, remember we want to find W and b that minimizes the cost function not the \(\hat{y}\) function.
Let us set predictor and response variables.
#Predictor variables
X <- as.matrix(data[, c(1,2)])
#Add ones to X in the first column (matrix multiplication x b)
X <- cbind(rep(1, nrow(X)), X)
#Response variable
Y <- as.matrix(data$label)
#Intial parameters
initial_parameters <- rep(0, ncol(X))
#Cost at inital parameters
CostFunction(initial_parameters, X, Y)
## [1] 0.6931472
With each step of gradient descent, parameters come closer to the optimal values that will achieve the lowest cost. To do this we will set (i.e.) the iteration parameter to 1200
# We want to minimize the cost function. Then derivate this funcion
TestGradientDescent <- function(iterations = 1200, X, Y) {
# Initialize (b, W)
parameters <- rep(0, ncol(X))
# Check evolution
print(paste("Initial Cost Function value: ",
convergence <- c(CostFunction(parameters, X, Y)), sep = ""))
# updating (b, W) using gradient update
# Derive theta using gradient descent using optim function
# Look for information about the "optim" function (there are other options)
parameters_optimization <- optim(par = parameters, fn = CostFunction, X = X, Y = Y,
control = list(maxit = iterations))
#set parameters
parameters <- parameters_optimization$par
# Check evolution
print(paste("Final Cost Function value: ",
convergence <- c(CostFunction(parameters, X, Y)), sep = ""))
return(parameters)
}
# How to use
parameters <- TestGradientDescent(X = X, Y = Y)
## [1] "Initial Cost Function value: 0.693147180559945"
## [1] "Final Cost Function value: 0.203497704580909"
# probability of admission for student (1 = b, for the calculos)
new_student <- c(1,25,78)
print("Probability of admission for student:")
## [1] "Probability of admission for student:"
print(prob_new_student <- Sigmoid(t(new_student) %*% parameters))
## [,1]
## [1,] 0.01350584
Is he out? :-(
Now, We’re gonna to try other option
if (!require("gradDescent")) install.packages("gradDescent")
## Loading required package: gradDescent
# load
library("gradDescent")
New Gradient Descent version
# We want to minimize the cost function. Then derivate this funcion
TestGradientDescent2 <- function(iterations = 1200, learning_rate = 0.25, the_data) {
# label in the last column in dataSet
model <- gradDescentR.learn(dataSet = the_data, featureScaling = TRUE, scalingMethod = "VARIANCE",
learningMethod = "GD", control = list(alpha = learning_rate, maxIter = iterations),
seed = 1234)
model
}
# How to use
TestGradientDescent2(the_data = data)
## $featureScaling
## [1] TRUE
##
## $scalingMethod
## [1] "VARIANCE"
##
## $scalingParameter
## [,1] [,2] [,3]
## [1,] 65.64427 66.222 0.6
## [2,] 19.45822 18.58278 0.492366
##
## $learningMethod
## [1] "GD"
##
## $model
## [,1] [,2] [,3]
## [1,] 1.666426e-16 0.5865089 0.5262028
##
## $control
## $control$alpha
## [1] 0.25
##
## $control$maxIter
## [1] 1200
##
## $control$momentum
## [1] 0.9
##
## $control$nBatch
## [1] 2
##
## $control$lamda
## [1] 0
##
## $control$innerIter
## [1] 10
##
## $control$option
## [1] 2
##
## $control$gammaS
## [1] 0.125
##
##
## attr(,"class")
## [1] "gradDescentRObject"
# Now, the exercises. Use training and test set, change the value of alpha...
Unit tests are small functions that test your code and help you make sure everything is alright. To do this in R, the testthat package can be used as follows:
# Install if not installed
if (!require("testthat")) install.packages("testthat")
## Loading required package: testthat
# load
library(testthat)
Now, we can check the code for the TestGradientDescent function.
test_that("Test TestGradientDescent",{
parameters <- TestGradientDescent(X = X, Y = Y)
# probability of admission for student (1 = b, for the calculos)
new_student <- c(1,25,78)
prob_new_student <- Sigmoid(t(new_student) %*% parameters)
print(prob_new_student)
expect_equal(as.numeric(round(prob_new_student, digits = 4)), 0.0135)
# Fail, test
# expect_equal(as.numeric(round(prob_new_student, digits = 4)), 0.0130)
})
## [1] "Initial Cost Function value: 0.693147180559945"
## [1] "Final Cost Function value: 0.203497704580909"
## [,1]
## [1,] 0.01350584
set.seed(1234)
parameters <- TestGradientDescent(X = X, Y = Y)
## [1] "Initial Cost Function value: 0.693147180559945"
## [1] "Final Cost Function value: 0.203497704580909"
predicted <- Sigmoid(parameters %*% t(X))
predicted[predicted >= 0.9] <- 1
predicted[predicted < 0.9] <- 0
as.vector(predicted)
## [1] 0 0 0 1 1 0 1 0 1 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0
## [36] 0 0 1 0 0 1 0 1 0 0 0 1 1 1 1 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 1 0
## [71] 0 1 1 0 1 1 1 0 0 0 1 1 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 1 0 1
real_cases <- as.vector(Y)
library(caret)
confusionMatrix(data = as.factor(predicted) , reference = as.factor(real_cases), dnn = c("Predicted", "Real cases"))
## Confusion Matrix and Statistics
##
## Real cases
## Predicted 0 1
## 0 39 16
## 1 1 44
##
## Accuracy : 0.83
## 95% CI : (0.7418, 0.8977)
## No Information Rate : 0.6
## P-Value [Acc > NIR] : 6.256e-07
##
## Kappa : 0.6667
##
## Mcnemar's Test P-Value : 0.000685
##
## Sensitivity : 0.9750
## Specificity : 0.7333
## Pos Pred Value : 0.7091
## Neg Pred Value : 0.9778
## Prevalence : 0.4000
## Detection Rate : 0.3900
## Detection Prevalence : 0.5500
## Balanced Accuracy : 0.8542
##
## 'Positive' Class : 0
##
# We want to minimize the cost function. Then derivate this funcion
total_iterations <- seq(1:1200) # number of iterations
vector_costs <- rep(0, length(total_iterations)) # vector of zeros to fill with costs
TestGradientDescent3 <- function(iterations, X, Y) {
for (iterations in total_iterations) {
# Initialize (b, W)
parameters <- rep(0, ncol(X))
# Derive theta using gradient descent using optim function
parameters_optimization <- optim(par = parameters, fn = CostFunction,
X = X, Y = Y,
control = list(maxit = iterations))
#set parameters
parameters <- parameters_optimization$par
vector_costs[iterations] <- c(CostFunction(parameters, X, Y))
}
return(vector_costs)
}
# the new vector with all the costs
all_costs <- TestGradientDescent3(length(total_iterations), X, Y)
library(ggplot2)
# prepare the data to visualizate
final_matrix <- as.data.frame(cbind(total_iterations, all_costs))
ggplot(final_matrix,
aes(x = final_matrix$total_iterations,
y = final_matrix$all_costs)) +
geom_line() + xlab('Iterations') + ylab('Cost')
TestGradientDescentMethods <- function(iterations = 1200, X, Y, method) {
# Initialize (b, W)
parameters <- rep(0, ncol(X))
# updating (b, W) using gradient update
# Derive theta using gradient descent using optim function
parameters_optimization <- optim(par = parameters, fn = CostFunction,
X = X, Y = Y,
control = list(maxit = iterations),
method = method)
#set parameters
parameters <- parameters_optimization$par
return(parameters)
}
3.1 Method “Nelder-Mead”
TestGradientDescentMethods(length(total_iterations), X, Y, "Nelder-Mead")
## [1] -25.1647048 0.2062524 0.2015046
3.2 Method “BFGS”
TestGradientDescentMethods(length(total_iterations), X, Y, "BFGS")
## [1] -25.1857467 0.2064298 0.2016679
3.3 Method “CG”
TestGradientDescentMethods(length(total_iterations), X, Y, "CG")
## [1] -0.091867806 0.011086754 0.001181676
3.4 Method “L-BFGS-B”
The function optim can´t be used with method “L-BFGS_B” in this case.
3.5 Method “SANN”
TestGradientDescentMethods(length(total_iterations), X, Y, "SANN")
## [1] -2.02560554 0.02820956 0.02279036
3.6 Method “Brent”
The function optim can´t be used with method “Brent” in this case.
Explain why is not a trivial task to calculate the Gradient Descent. What happens if we have a very high dimensional problem?
Optional (+0.5 - 1 points in final grade).
Explain your work results using the RMarkdown format.
Some ideas from (recommended course) https://www.coursera.org/specializations/deep-learning
Fok, Sam. 2012. “Derivate Cost Function for Logistic Regression.” http://sambfok.blogspot.com.es/2012/08/partial-derivative-logistic-regression.html.
Gondaliya, Amar. 2013. “Logistic Regression with R.” https://www.r-bloggers.com/logistic-regression-with-r-step-by-step-implementation-part-2/.
Magdon-Ismail, M. 2017. “Logistic Regression and Gradient Descent.” http://www.cs.rpi.edu/~magdon/courses/LFD-Slides/SlidesLect09.pdf.
Raschka, Martin. 2017. “Machine Learning FAQ.” https://sebastianraschka.com/faq/docs/closed-form-vs-gd.html.
Wikipedia - Binary Classification. 2017. “Binary Classification.” https://en.wikipedia.org/wiki/Binary_classification.
Wikipedia - Logistic Regression. 2017. “Logistic Regression.” https://en.wikipedia.org/wiki/Logistic_regression.